-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Draft: Migrate to OpenTelemetry tracing #13400
Draft: Migrate to OpenTelemetry tracing #13400
Conversation
pyproject.toml
Outdated
| url_preview = ["lxml"] | ||
| sentry = ["sentry-sdk"] | ||
| opentracing = ["jaeger-client", "opentracing"] | ||
| opentelemetry = ["opentelemetry-api", "opentelemetry-sdk"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what I am doing with the dependencies here. Please double check this stuff.
```
Invalid type StreamToken for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
```
Had to add a few more logs to find this instance since the warning doens't give much info where I am setting this invalid attribute.
This was good enough to find it in the code.
```
BoundedAttributes __setitem__ key=since_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481272, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)
BoundedAttributes __setitem__ key=now_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481287, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)
BoundedAttributes __setitem__ key=token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481291, typing_key=0, receipt_key=340, account_data_key=1237, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0)
```
Fix error:
```
AttributeError: 'SpanContext' object has no attribute 'get'
```
`Context`:
```
{'current-span-1a226c96-a5db-4412-bcaa-1fdd34213c5c': _Span(name="sendToDevice", context=SpanContext(trace_id=0x5d2dcc3fdc8205046d60a5cd18672ac6, span_id=0x715c736ff5f4d208, trace_flags=0x01, trace_state=[], is_remote=False))}
```
`SpanContext`:
```
SpanContext(trace_id=0xf7cd9d058b7b76f364bdd649c4ba7b8a, span_id=0x287ce71bac31bfc4, trace_flags=0x01, trace_state=[], is_remote=False)
```
The `incoming-federation-request` vs `process-federation_request` was first introduced in #11870 - Span for remote trace: `incoming-federation-request` - `child_of` reference: `origin_span_context` - `follows_from` reference: `servlet_span` - Span for local trace: `process-federation-request` - `child_of` reference: `servlet_span` (by the nature of it being active) - `follows_from` reference: `incoming-federation-request`
| try: | ||
| with opentracing.start_active_span( | ||
| with tracing.start_active_span( | ||
| "db.txn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When viewing a trace for a servlet, we have the db.{desc} span but the child database spans are missing like db.connection, db.txn, db.query 🤔
Probably need to assert that the parents are set correctly in the tests as well (particularly with deferreds).
Conflicts: synapse/logging/opentracing.py tests/logging/test_opentracing.py
Conflicts: poetry.lock synapse/federation/federation_client.py synapse/federation/federation_server.py synapse/handlers/federation.py synapse/handlers/federation_event.py synapse/logging/opentracing.py synapse/rest/client/room.py synapse/storage/controllers/persist_events.py synapse/storage/controllers/state.py
Conflicts: poetry.lock synapse/api/auth.py synapse/federation/federation_client.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/sendtodevice.py synapse/storage/schema/__init__.py
Conflicts: synapse/storage/schema/__init__.py
```
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
callback_return = await self._async_render(request)
File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 512, in _async_render
callback_return = await raw_callback_return
File "/usr/local/lib/python3.9/site-packages/synapse/federation/transport/server/_base.py", line 357, in new_func
remote_parent_span = create_non_recording_span()
File "/usr/local/lib/python3.9/site-packages/synapse/logging/tracing.py", line 502, in create_non_recording_span
return opentelemetry.trace.NonRecordingSpan(
AttributeError: 'NoneType' object has no attribute 'trace'
```
Hopefully fix:
```
File "/home/runner/work/synapse/synapse/synapse/storage/controllers/persist_events.py", line 246, in add_to_queue
links=[Link(end_item.tracing_span_context)],
builtins.TypeError: __init__() takes 1 positional argument but 2 were given
```
Conflicts: synapse/storage/schema/__init__.py
Conflicts: .github/workflows/tests.yml poetry.lock synapse/storage/schema/__init__.py
Conflicts: synapse/handlers/message.py synapse/logging/opentracing.py
Conflicts: poetry.lock (conflicts not fixed) synapse/handlers/message.py synapse/handlers/relations.py synapse/storage/databases/main/devices.py synapse/storage/schema/__init__.py
Conflicts: docs/usage/configuration/config_documentation.md poetry.lock synapse/handlers/message.py synapse/http/server.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/room.py
Conflicts: poetry.lock
| span = start_span( | ||
| name=name, | ||
| context=context, | ||
| kind=kind, | ||
| attributes=attributes, | ||
| links=links, | ||
| start_time=start_time, | ||
| record_exception=record_exception, | ||
| set_status_on_exception=set_status_on_exception, | ||
| tracer=tracer, | ||
| ) | ||
|
|
||
| # Equivalent to `tracer.start_as_current_span` | ||
| return opentelemetry.trace.use_span( | ||
| span, | ||
| end_on_exit=end_on_exit, | ||
| record_exception=record_exception, | ||
| set_status_on_exception=set_status_on_exception, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this over from #13440 (comment),
Occasionally, I see spans that have a super massive duration and shows up with 213503982d 8h in the Jaeger UI. Obviously looks like some sort of max duration bug.
The raw data for those spans are the following. Only these spans are messed up and the end_time does not affect any more parent spans before _process_pulled_event.
db.get_partial_state_events
{
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "5b63b0172763c5cf",
"flags": 1,
"operationName": "db.get_partial_state_events",
"references": [
{
"refType": "CHILD_OF",
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "10c9f2a2c50e286f"
}
],
"startTime": 1659994809436162,
"duration": 18446744073709517944,
"tags": [
// ...
],
"logs": [],
"processID": "p1",
"warnings": null
}_process_pulled_event
{
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "6e9ee2608b03b542",
"flags": 1,
"operationName": "_process_pulled_event",
"references": [
{
"refType": "CHILD_OF",
"traceID": "563478e5d75db03e9fc028822fde0649",
"spanID": "af2c9c8d08284134"
}
],
"startTime": 1659994809432356,
"duration": 18446744073709532019,
"tags": [
// ...
],
"logs": [],
"processID": "p1",
"warnings": null
},
Definitely close to the Python "max int size" (unsigned word, 64 bits): sys.maxsize * 2 + 1 (or 2^64 - 1) -> 18446744073709551615 (via https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints)
It seems like it's some negative value getting turned into an unsigned integer so it wraps to the end of the value range. But I don't see the math where that can happen yet.
start_time and end_time are calculated by time.time_ns() (_time_ns)
And the Jaeger exporter calculates startTime and duration with this small _nsec_to_usec_round function.
To see if it's just some endTime (0) - startTime (some positive number) = some negative number, type of problem: If we take the Python max value minus the duration we see from _process_pulled_event: 18446744073709551615 - 18446744073709532019 -> 19596 microseconds. TODO: Is that reasonable?
|
Closing as our existing OpenTracing setup seems to just work over UDP inside Docker (for Complement tests) on my Linux machine now so I don't have a personal need for this anymore (didn't work properly on macOS). This is probably best tackled with a fresh PR with some of the learnings from here anyway |

Migrate to OpenTelemetry tracing
Fix #11850
Blockers
Twistedrelease that includes thecontextvarsfixestwisted==22.8.0(2022-09-06) was released 🚀 - need to try it out!jaeger-clienttoopentelemetry#11850 (comment))force_tracing_for_usersDev notes
via https://stackoverflow.com/a/70064450/796832
Typing
Type hints for optional imports:
OpenTelemetry
tags.pychild_ofto supplying thecontexton a OTEL spanfollows_fromreferences->linkson an OTEL spanfollows_fromandchild_of? See https://opentracing.io/specification/Code:
synapse/logging/opentracing.pysynapse/config/tracer.pyOTEL code:
SpanAttributes(opentelemetry-semantic-conventions/src/opentelemetry/semconv/trace/__init__.py)Sampling:
opentracing: https://github.com/jaegertracing/jaeger-client-python#debug-traces-forced-samplingUDP on macOS (maybe Docker)
Error when using
jaeger.thriftover Thrift-compact protocol with UDP on port6831homeserver.yamlFix: open-telemetry/opentelemetry-python#1061 (comment) -> https://www.jaegertracing.io/docs/1.19/client-libraries/#emsgsize-and-udp-buffer-limits -> jaegertracing/jaeger-client-node#124
To make those changes stick between OS reboots, make sure to add it to
/etc/sysctl.confas well:/etc/sysctl.confYou also have to set
udp_split_oversized_batchesto split the batches up if they go over the 65k limit set in OTELhomeserver.yamlUsing thrift over UDP (port
6831) to communicate to the Jaeger agent doesn't seem to work from within a Complement test Docker container. I wonder why this is the case? I was seeing this same behavior with the Jaegeropentracingstuff. Is the UDP connection being over saturated (1065 spans in one trace)? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported. But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace (collector_endpoint: "http://localhost:14268/api/traces?format=jaeger.thrift")TODO
synapse/logging/scopecontextmanager.pystart_active_span_follows_fromstart_active_span_from_edusynapse/federation/transport/server/_base.py(search forscope)docs/tracing.mdFailed to detach contexterrors (ValueError: <Token var=<ContextVar name='current_context' default={} at 0x111b80cc0> at 0x111b98880> was created in a different Context)force_tracing_for_usersPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.Pull request includes a sign off(run the linters)